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Abstract 

Appropriate data distribution has been found to be critical for obtaining good per- 
formance on Distributed Memory Multicomputers like the CM-5, Intel Paragon and 
IBM SP-1. It has also been found that some programs need to change their distri- 
butions during execution for better performance (redistribution). This work focuses 
on automatically generating efficient routines for redistribution. We present a new 
mathematical representation for regular distributions called PITFALLS and then dis- 
cuss algorithms for redistribution based on this representation. One of the significant 
contributions of this work is being able to handle arbitrary source and target processor 
sets while performing redistribution. Another important contribution is the ability 
to handle an arbitrary number of dimensions for the array involved in the redistri- 
bution in a scalable manner. Our implementation of these techniques is based on an 
MPI-like communication library. The results presented show the low overheads for our 
redistribution algorithm as compared to naive runtime methods. 


*This research was supported in part by the Office of Naval Research under Contract N00014-91 J- 1096, 
and in part by the National Aeronautics and Space Administration under Contract NASA NAG i-613. 




1 Introduction 

1.1 Motivation for Data Redistribution 

Distributed Memory Multicomputers such as the Intel Paragon, IBM SP-I and the Connec- 
tion Machine CM-5 offer significant advantages over shared memory multiprocessors in terms 
of cost and scalability. Unfortunately, to extract all that computational power from these 
machines, users have to write efficient software for them, which is an extremely laborious 
process. The PARADIGM compiler project at the University of Illinois is aimed at automat- 
ically generating a parallel FORTRAN program for any Distributed Memory Multicomputer 
given an input FORTRAN77 program. The fully implemented PARADIGM compiler will 
automatically: 

• Determine a good data partitioning scheme for the input program [1, 2, 3, 4] 

• Use the data partition determined (or user provided data distribution directives) to 
partition computation between the processors of the system and generate the required 
communication routines [5, 6, 7] 

• Detect available functional and data parallelism and use this information to make 
program execution efficient [8, 9] 

• Provide compiler and runtime support for irregular computations [10] 

One of the major aspects of programming/compiling for Distributed Memory Multi- 
computers has been data distribution. A good distribution of data can eliminate a lot of 
unnecessary communication and thus provide good speedups. There have been many efforts 
on developing automatic data partitioning techniques [1, 3, 11, 12, 13]. In addition, user pro- 
vided constructs have been proposed in some form or the other in every FORTRAN dialect 
for Multicomputers including FORTRAN D and HPF [14, 15]. Recently, the HPF standard 
has been widely adopted in industry and academia for specifying data distributions. HPF 
also provides directives for data redistribution dynamically during program execution. In 
this work, we will consider only ’’regular” distributions along each array dimension, i.e., one 
of - ALL, BLOCK, CYCLIC or BLOC KCYC LIC(x). What we mean by these terms is 
made clearer in Figure 1. Considering only these distributions is reasonable because a large 
number of scientific programs have been found to use such distributions. 

In many programs, the distribution of an array needs to be changed for different phases of 
a program in order to achieve good performance. For instance, a 2D FFT routine comprises 
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Figure 1: Examples of Regular Distributions 


a sequence of two ID FFT operations on the input matrix. First, ID FFTs are computed 
along each row, this is followed by a ID FFTs being computed along each column. The best 
distribution for the first phase would be BLOCK or CYCLIC along the row dimension. On 
the other hand, the second phase would be best performed using a BLOCK or CYCLIC ' 
distribution along the column dimension. It must be noted that these distributions would 
result in zero communication within each phase but require a redistribution between phases. 
This redistribution can be avoided by distributing the array along both dimensions, however, 
this will cause considerable communication within each phase. The work in [13] shows that 
the performance of a 2D FFT on an iPSC/860 is best when redistribution is used. 

Redistribution of data is very critical for Multiple Program Multiple Data (MPMD) 
programming. In such programs different subsystems of a given processor system execute 
different parts of a program. This is in contrast to the popular Single Program Multiple Data 
type of programs where essentially all processors are executing the same program, but on 
different data sets. MPMD programs can potentially execute faster than SPMD programs 
by making the execution more efficient. Frequently, data dependence constraints in MPMD 
programs require arrays be redistributed from one subsystem to another. Figure 2 shows 
this clearly. Here, array A is being written into by the first set of processors and being 
read by the second set of processors. The PARADIGM compiler effort, is among the first to 
consider the problem of automatic MPMD program generation. This has been one of the 
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A[i] = ... 


= A[i] ... 


Figure 2: Need for Redistribution in MPMD Programs 


primary motivations for the work presented in this paper. More details on MPMD programs 
can be found in [8, 9, 16, 17, 18, 19]. 

1.2 The Data Redistribution Problem 

Having motivated the need for redistribution, we can now formally define a redistribution R 
to be the set of routines that - given an n-dimensional array A on a set of source processors P s 
with source distribution D s , transfer all the elements of the array to a set of target processors 
P t with a target distribution D t . In the general case, D s and D, can specify arbitrary data 
distributions along each dimension of the array. However, as mentioned before, we only 
handle regular distributions at this point. Therefore, a redistribution routine needs to figure 
out exactly what data needs to be sent (received) by each source(target) processor. It is 
possible to use a simple runtime resolution approach for redistribution. In this approach, 
each source processor computes the index of each of the elements it owns based on the 
source distribution; uses this index to compute the target processor for it based on the 
target distribution and packs it into a buffer meant for that processor; sends the contents 
of its buffers to the target processors. The target processors essentially do the reverse. 
However, as we shall see later, this approach is very costly compared to a method like ours 
which makes use of the distribution information in a more intelligent manner. 

The important features of our redistribution method are: 

• Redistribution routines are to be automatically generated at compile time. This work 
will be part of the PARADIGM compiler support for MPMD program generation. 

• The source and target processor sets can be any arbitrary subset of the given processor 
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system. The motivation for this comes from MPMD programs. For such programs 
we must be able to handle a redistribution for example, from processors 0,3, 1,6 to 
processors 1,2. We use ideas similar to those used in MPI [20] to handle arbitrary 
processor sets. 

• Arrays being redistributed can have any of the possible regular distributions on the 
source and target processor sets. Note that one of the most general case of redistribu- 
tion is BLOCI<CYCLIC(m) to BLOC I< CYC LIC{n), where m and n are relatively 
prime. We handle all the types of distributions in an uniform manner. 

• Arrays being redistributed can have an arbitrary number of dimensions. The complex- 
ity of our algorithms scales linearly with an increase in the number of dimensions. 

• We handle multiple arrays being redistributed at the same time using message aggre- 
gation [5, 6, 14]. This means we have just one send-recv between a pair of processors 
even if there is data from more than one array being communicated between them. 

Our redistribution techniques rely on a mathematical representation for regular distri- 
butions which we call PITFALLS (Processor Index Tagged FAmiLy of Line Segments). Al- 
though many representations exist for regular data distributions on distributed memory 
machines [14, 21, 22, 6], we felt the need for a new representation for our work because 
none of the previous representations satisfied our requirements. A primary requirement tor 
us was to be able to mathematically represent a regular distribution on any given subset of 
processors of the given system and not on the entire system as all the current representations 
do. In addition, we needed to be able to perform redistribution communication analysis ef- 
ficiently. We therefore developed the PITFALLS representation along with a redistribution 
algorithm based on it. This is discussed in the next section. We have also included a brief 
explanation of our implementation and provided the results of a comparison of our method 
with a runtime resolution type of method in Section 3. Finally, we discuss the implications 
of our work and future extensions. 

1.3 Related Work 

It is possible for one to argue that redistribution can be performed using multicomputer 
compiler techniques such as those outlined in [5, 6, 23, 14, 21, 22]. These techniques gen- 
erate the communication required for any program statement to execute correctly given the 
distributions of the arrays involved in the statement. One could now use a statement of the 
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form B = A where B is distributed according to the target distribution and A is distributed 
according to the source distribution. However, it is not clear that any of the currently im- 
plemented compilers have efficient techniques to handle all the regular distributions possible 
for A and B. Another major obstacle in trying to use current compiler techniques is that 
they cannot handle the given statement if A and B are distributed on a distinct (possibly 
overlapping) set of processors. On the other hand, the possibility of handling arbitrary pro- 
gram statements using techniques similar to the ones used in this work are being considered 
for the PARADIGM compiler. The reason for this is the simplicity and practicality of our 
techniques (the algorithms given in this paper have all been implemented). 

The work by Agarwal et. al. [24] provides runtime support for redistributions. They 
construct schedules for redistribution at runtime and reuse schedules if a particular redistri- 
bution pattern occurs more than once. However, this work neither handles the CYCLIC 
or BLOCK CYC LIC type of distributions nor does it consider arbitrary source and target 
processor sets. 

Recent work by Thakur et. al. [25] considers redistributions of regular arrays in detail. 
This work is implemented in the form of a library for a HPF compiler. The methods proposed 
treat possible source-target distributions in a pairwise manner. Their general approach 
is to use a runtime resolution approach such as the one described before; although, for 
specific cases of source-target data distributions, they use efficient methods. For the multi- 
dimensional case, [25] propose a solution which is very expensive; such redistributions 
are considered to be composed of a series of one-dimensional redistributions. Figure 3 
illustrates the difference of their approach from ours. In this example, the basic problem is 
the redistribution of a two-dimensional array A from {BLOCK, ALL) to [ALL, B LOC A). 
The approach of [25] would be to carry it out as a set of two redistributions - first, from 
{BLOCK, ALL) to {ALL, ALL) and then, from {ALL, ALL) to {ALL, BLOC K). On the 
other hand, our approach would be to directly go from {BLOC K, ALL) to {ALL, BLOC A ) 
as shown. 

2 The PITFALLS Representation and Redistribution 

Broadly, to perform a redistribution, for any source-target processor pair, one has to look at 
the set of elements owned by the source processor before redistribution (based on the source 
distribution) and the set of elements owned by the target processor after redistribution 
(based on the target distribution). The intersection of these sets is the data that needs to be 
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Figure 3: Need for Redistribution in MPMD Programs 


transferred between the pair of source-target processors. The PITFALLS representation is 
particularly useful in this context as it can be used to easily determine the two things most 
important for redistribution - which pairs of source-target processors need to communicate 
and the intersection of the data sets of such a pair of source-target processors. 

For simplicity, we first develop the PITFALLS representation for regular distributions of 
a one-dimensional array in a set by step manner. We will later extend our ideas to multiple 
dimensions. 


2.1 Line Segments (LS) 

Consider a one-dimensional array A of size n. Fundamental to the PITFALLS representation 
is the idea of using Line Segments (LS) to represent a contiguous block of elements. An LS 
L can be represented by a pair of numbers (/,r). For our representation, this LS (in the 
context of array A) is taken to mean the block of elements of A with indices starting at 
l and ending at r (numbering r — / + 1). We call the quantity l as the LOW of L; i.e., 
/ = LOW(L). Similarly, r is called the HIGH of L (r = HIGH(L)). Note that a single 
element with index / has the LS representation (/,/). 

Since our primary interest is in being able to find intersections for sets of elements, we 
see that the intersection of two LS’s L\ = (/i,ri) and Li = (/ 2 , r 2 ) (denoted by LI Li Ql 2 ) ’ s 
given by: 
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F = (0, 0, 6, 6) 

Figure 4: Examples of FALLS 

Lj _ f (max(/i,/ 2 ),min(r 1 ,r 2 )) if max{l x , h) < min{r x , r 2 ) 

" \ 0 otherwise 

2.2 FAmiLy of Line Segments (FALLS) 

The LS notion can be extended to what we call a FAmiLy of Line Segments (FALLS). A 
FALLS F can be represented by a tuple (l,r, str,num). Intuitively, F represents a set of 
niim equally spaced, equally sized blocks of elements; the first block starts at / and ends at 
r; the stride between successive Vs is str. Note that these are non-overlapping blocks. The 
z'th (0 < i < num — 1) LS of F (denoted by L' and called the ith member) is given by: 

L’ = (/ + i x str , r + i x str ) 

Figure 4 shows a few examples of FALLS. 

Using the notion of FALLS, it is possible to represent the set of elements of A owned by 
a particular processor under any regular distribution. Figure 5 shows the FALLS represen- 
tation for elements owned by processor 1 in a 4-processor system for various distributions of 
A when n = 32. In this example, it turns out that in every case, processor l’s elements can 
be represented using a single FALLS. This may not be true in the general case, where, more 
than one FALLS may be needed. However, it is easy to show that no more than two FALLS 
are needed for any regular distribution. In Figure 6, we show an example of processor 2 
needing two FALLS when a array of size 32 is distributed using BLOC KCYLIC(3). 

Once again, in the context of redistribution, computing the intersection of two FALLS is 
of interest to us. Given two FALLS F\ = {li,ri,strx,numi) and F 2 = (/ 2 , r 2 , str 2 , numi), a 
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Figure 5: Examples of FALLS Describing Regular Distributions 
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0 1 2 3 4 5 A 7 a 9 10 II 12 13 !4 15 IA 17 18 19 20 21 22 23 24 25 2A 27 28 29 30 31 


BLOCKCYCLIC(3){=(6,8,12,2) U (30,31,1,1)} 

Figure 6: Examples of Distribution Requiring Multiple FALLS 

simple brute force algorithm to compute the intersection (F/ Fi q F 2 ) is shown in Figure 7. 
This brute force approach just considers every possible pair of members from the two FALLS 
and applies the LS intersection algorithm to them. We can see that this technique can be 
very inefficient by considering the example of FALLS intersection shown in Figure 8. In this 
example, there are just 4 non-empty intersections whereas our brute force algorithm would 
perform 16 iterations. 

There are a couple of important observations to make in our example of Figure 8. 

for ii = 0, numi — 1 

Li = (l i + i i x stri,r\ + i\ x strq) 
for i 2 = 0, numi — 1 

L 2 — (h T z '2 x str 2 , t 2 ~b ?2 ^ 5 ^ 2 ) 

Fhx P]F 2 = FIf x f)F 2 U LI Lx n ^2 


Figure 7: Brute Force FALLS Intersection Algorithm 
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(OM.4) 

(2,3,C4)-'" 

Intersection { =(2,3, 8,4) } 

Figure 8: Example of FALLS Intersection 

One of them is that the intersecting pairs of members of the two FALLS (circled in the 
figure) have the same relationship between them; i.e., their relative displacement is the 
same. This gives rise to the idea of periodicity in the relationships between members of 
the two FALLS. The length of the intersection period (FP Fi p j F J for a given pair of FALLS 
Fi = (hi r i,stri, num\) and F 2 = ( l 2 ,r 2 ,str 2 ,num 2 ) can be written down as: 



FP Fl r j Fa = lcm(stri, str 2 ) 


We also find it convenient to define a pair of quantities called m t 


and rn 2 as follows: 


m i = 
m 2 = 


fp Fi p| f 2 
str\ 

FP c lf > 2 


Intuitively, these quantities represent the number of members of each FALLS occurring in a 
period. It can easily be verified that a pair of members from the two FALLS (ii, i 2 ) will have 
the same relative displacement as the pair of members (ii + m\,i 2 + m 2 ). For the example 
of Figure 8, F P Fi q F 2 = 8, mi = 1, and m 2 = 1. 

Another observation to make in Figure 8 is that the intersection of the two FALLS in 
this case turns out to be a FALLS (as noted in the figure). These observations imply that 
we need only look at possible intersections between pairs of members of the two FALLS that 
occur within a period and extend any intersection that may result to all other periods (thus 
giving rise to a FALLS structure). This gives us a more efficient intersection computation 
algorithm showm in Figure 9. For the algorithm, [I x , / 2 ) is the first pair of members of the 
two FALLS that intersect; all other terms have the same meaning as explained above. If we 
use this algorithm for the example of Figure 8, we see that (7 t = 0, 1 2 — 0) is the first pair of 
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for ii = I\, Ii + mi — l 

L\ = (/i + ii x stri , -+- *1 x sir j) 


for i 2 = / 2 i ^2 + ^2 — 1 

Z >2 = (^2 " t " *2 X stv 2 , T 2 " 4 " ^2 X str 2 ') 

i((Lh,QL,^n 

l = LOW(LI Li n£ , a ) 

str = FP Fl q Fj 

num = min( nmi^=l I I a!ffia=a= l 

v L J l ™2 

fi f . f)F 2 = n f 2 ua r > 5fr ’ raum ) 


) + i 


Figure 9: FALLS Intersection Algorithm Based on Periods 
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(0,3j6,2) : 


(2.3,4, 8) ! 

Intersection ( =(2,3, 1 6,2) ) 

Figure 10: Another Example of FALLS Intersection 

members of the two families that intersect. As seen earlier, mj = m 2 = 1 and F P Fl n F2 — 8. 
This means our algorithm will iterate just once with L\ = (0,3) and L 2 = (2,3) which gives 
us a non-empty intersection Ll^n w 2 = (2,3) making 1 = 2 and r = 3. Using the value 
8 for F P F] p| f 2 gives us str = 8. Finally, num can be calculated as 4, which gives us the 
intersection FALLS as FI Fx ^ F2 = (2, 3, 8, 4). 

In the example just considered, we had only one resultant FALLS. This may not be the 
case in general. The maximum number of FALLS that can be produced using this algorithm 
can be shown to be n?i + m 2 . However, this is not of much concern since m x and m 2 are 
very small in most situations. 

Although the algorithm outlined above substantially cuts down on the number of itera- 
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tions performed as compared to the brute force algorithm described first, it is still not very 
efficient. This can be seen by considering the example of Figure 10. For this example, 
777 ] = 1 and m 2 = 4, which means our algorithm iterates 4 times. However, we see that the 
intersection consists of just one FALLS, which means 3 of the iterations produce no FALLS 
and are thus wasted. This gives us the possibility of constructing another algorithm which 
still looks at pairs of members within a period, but, does not consider pairs that do not 
intersect. This pruning is done by considering the intersection of a pair of LS’s L\ = (/j,ri) 
and L 2 = (/ 2 ,r 2 ). We see that this intersection is non-empty when max(/i,/ 2 ) < min(ri,r 2 ). 
This can happen under one of four conditions as listed below: 


h 


n < r 2 



h 

<h 

h < r 2 

r 2 

< r x 

h 

<h 

7*2 < 



h 


h < 

r 1 

< r 2 


Applying these conditions to determine when a pair of members (*i , * 2 ) of two FALLS 
Fi = {h,ri,strx, numi) and F 2 = (/ 2 , r 2 , str 2 , num 2 ) will intersect gives us: 


l 2 + i 2 x str 2 < l x + i\ x str x 

1-2 + *2 x str 2 < 1 1 + ii x str x 

1 1 + 7] x str x < l 2 + 7 2 x str 2 

l x + i j x stri < l 2 + i 2 x .st r 2 


t 1 + 7] x str\ < r 2 + i 2 x str 2 

l\ + i\ x sir 1 < r 2 + i 2 x str- 2 

r 2 -f 7*2 x str 2 < T[ + 7 ] x stri 

l 2 + 7 2 x str 2 < ri + A x stri 


r 2 + i 2 x str 2 d 7’] 
r x + 7 1 x str x < 7’ 2 + 7 2 x &tr-z 


By performing a more detailed analysis, we can reduce these conditions to the following: 


72 ^ 7x X 
i 2 < 7i X 


sir i j l\—T2 
stri ' stri 

str L 1 n-12. 

stri stri 


The equations above give us a method to determine which members of the two FALLS will 
actually intersect. It can be seen that given a member of the first FALLS we can use these 
conditions to determine the lower and upper bounds of members of the other FALLS (i 2 s) 
that will intersect with the given member. We can now construct an efficient intersection 
algorithm as shown in Figure 11. 

Note that we do not check for an empty LI Li ^ Li because we are guaranteed it is non- 
empty by iterating over the loop bounds computed using the conditions listed above. For 
our example of Figure 10, we can see that our algorithm will iterate only once and produce 
the FALLS FI FiC]F2 = (2,3, 16,2). 

As seen before, some regular distributions may result in a processor having a set of FALLS 
representing its elements rather than just one FALLS. Intersection of a set of multiple FALLS 


11 



' r? f,nF, = « 

/, = max(0, [ ^r? 1 ) 

for i'i = /i,min(/i + mi — l.mrnii — 1) 

Li = (/j + i, x sir l5 r x + i\ x sfr t ) 

for i 2 = max(0, [ ‘ 1 ), min( ,m 2 - l,nuro 2 - 1) 

^2 = (^2 + *2 x , ^2 + x 
l = LOW(Lh, nL2 ) 
r = HIGH(LI Ll nij ) 

sir = 

mim = min( nu ’ ,I '~‘ 1 ~ 1 , "“T 2 " -) + 1 

v m i ' m2 y 

F/f, H f 2 = F7 Fl H f 2 U(/, r, sir, ram) 


Figure 11: Efficient FALLS Intersection Algorithm Based on Periods 

with another set of multiple FALLS can be done by intersecting each possible pair of FALLS 
from the two sets (using the algorithm described above). 

The conditions used for constructing the efficient FALLS intersection algorithm can also 
be used to construct a boolean function B F ,Qf 2 defined as: 

( TRUE i/P/ r , nF) /0 

FALSE otherwise 

In evaluating the function we use the parameters of the two FALLS. Intuitively, the func- 
tion checks for the existence of at least one pair (ii,i 2 ) satisfying the intersection conditions. 
Due to lack of space we are unable to present more details of this boolean function. As we 
will see later, it plays an important role in computing PITFALLS intersection. 

2.3 Processor Index Tagged FAmiLy of Line Segments (PIT- 
FALLS) 

Returning to the problem of redistribution, we can now see that a possible method could be 
to first construct a FALLS representation for each source processor based on the source data 
distribution and for each target processor based on the target data distribution. Next, we 
could iterate over all source-target pairs and determine the data to be sent between them 
using the FALLS intersection algorithm described above. However, this may not be very 
efficient in many cases since there may be many source-target processor pairs that do not 
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INTERSECTION f- 0 } 



Figure 12: Example of Redistribution 


communicate. Figure 12 shows a redistribution of a 32-element array from a BLOCK 
distribution to a BLOC KCYC LIC{2) distribution. Here we assume there are 4 sending 
processors and 8 receiving processors. If we consider the edges to represent communication, 
we can see that there will be no communication, for instance, between processor 0 on the 
sending end and processor 5 on the receiving end. In order to avoid this unnecessary iteration, 
we extend the FALLS representation to what is called the PITFALLS representation. A 
PITFALLS P is defined by a tuple ( l,r,str,num, disp, proc ). We can see that we have two 
new parameters (lisp and proc as compared to the FALLS representation. Intuitively, a 
PITFALLS represents a set of equally spaced FALLS for a set of proc processors with the 
spacing between the /’s of successive processor FALLS being disp. Formally, the pth FALLS 
(0 < p < proc — 1) of a PITFALLS P = ( l,r,str,num,disp,proc ) (denoted by F p and called 
the pth member of P) is given by: 

F p = (l + p x disp, r + p x disp , ram, str ) 

The advantage of using PITFALLS is that we do not use a separate set of FALLS for 
each processor; instead, one set of PITFALLS is used for the entire set of processors across 
which an array is distributed. The PITFALLS representation is parameterized by the IDs 
of the processors. Thus, given an ID, we can determine the FALLS representation for the 
associated processor. Examples of PITFALLS for a few regular distributions of a 32 element 
array are shown in Figure 13. It can be shown easily that no more than three PITFALLS 
are needed to represent any regular distribution of an array. 

As mentioned before, for redistribution, we are interested in being able to perform inter- 
sections on our representation. The advantage of the PITFALLS representation is that it not 
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BLOCK 


0 I 2 3 i 5 6 7 8 9 10 II 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 

FALLS 

P0 (0,7, U) 


PI (8,15,1,1) 
P2 (15,23,1,1) 
P3 (24,31,1,1) 


PITFALLS 



BLOCKCYCLIC(2) falls 

P0(0, 1,8,4) — 


mm) 


mm) 


PITFALLS 

(0,1, ,8,4 4, 2, 4) N 

disp 

Figure 13: Examples of PITFALLS 
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for pi — 0, proC\ — 1 

F Pl = (l i + pi x dispi,ri -f p x x dispi,numi,stri) 
for p 2 = 0, proc 2 — 1 

F P2 = (/ 2 + P 2 x disp 2 ,r 2 + p 2 x disp 2 ,num 2 ,str 2 ) 

if Bppi p|^rp 2 == true 

compute F/ FP1 pj FP2 


Figure 14: PITFALLS Intersection Algorithm 

only helps us perform efficient intersection of data sets for a pair of processors, but also helps 
us determine which pair of processors will have a non-empty intersection. Consider a pair 
of PITFALLS Pi = (li, r u stri, numi,dispi,proci) and P 2 = (/ 2 , r 2 , sir 2 , num 2 , di.sp 2 , proc 2 ). 
We can now write down the FALLS representation for a pair of members (pi,p 2 ) from the 
two PITFALLS as: 

F P1 = (h+pi x dispi , ri + pi x dispi,num,i,stri) 

F p 2 = (/ 2 + p 2 x disp 2 ,r 2 + p 2 x disp 2 ,num 2 ,str 2 ) 

We have previously defined a boolean function to determine whether a pair of FALLS 
will have a non-empty intersection. We can now' use this function to determine whether the 
pair of FALLS ( F p ' , F P2 ) will intersect. Hence, we can decide whether the pair of processors 
(pi,p 2 ) will need to communicate during redistribution. This is the basis for the PITFALLS 
intersection algorithm shown in Figure 14. 

2.4 Multi-dimensional Array Redistribution 

Until this point we have only considered a one-dimensional case for all our representations 
and algorithms. Extending these to the multi-dimensional case is trivial and can be done 
by simply looking at the representations for each dimension and performing intersections 
on them independently. An example of the two-dimensional case is provided in Figure 15. 
Here, we show the FALLS representation for processor 1 of a 4 x 2 processor grid for two 
given distributions. The first has the array distributed in a [BLOC I\,CY C LIC) manner; 
the second has it distributed in a {CYCLIC, BLOCKCYC LIC( 4)) manner. We can see 
that the FALLS representation for each dimension is independent of the others. Our multi- 
dimensional FALLS intersection algorithm is shown in Figure 16. 

After performing the dimension-by-dimension intersection, we can see that the set of 
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Figure 15: Examples of FALLS for Multidimensional Arrays 


for dim = 0. N urn Dims 


P jdim ( 

*■ 1 pdim pdim ~ i 

I dim _ max (Q } 

for it = /f 


Ir' m — rr 


.sir? 


) 


min(/f !m + raf m — 1, numf m — 1) 
Li = (/f m + it x $trf im , rf im + i,x 


for ?2 = max(0. 


str% 


L 2 = + i 2 x str 2 

l = LOW{LI LlL2 ) 

T = HIGH{LI Li l 2 ) 

— pr pdim 

1 pdim pdim 

num = min(- 


),min( 

dim ~,dim 


„dim i dim 
r i m 

st r* m 


M im 

dim > 


num 2 im 


rf m + i 2 x str 2 m ) 


— t i—l num~ 


M 


-12-1 


) + 1 

F 1 I pdim ^rdim = Flfdxmfdim U(/, r , .sfr, num) 


Figure 16: Multi-dimensional FALLS Intersection Algorithm 
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0 1 2 3 4 5 6 7 8 9 10 II 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 

(1,1,2,16) 

(4,7 ? 8,4) 

(5,5,8,4) U (7, 7, 8, 4) 

Figure 17: Example of Multidimensional FALLS Intersection 


resulting FALLS represent the set of indices in each of the dimensions that need to be 
transferred. Thus, during the transfer we have to consider rectangular sections of elements 
by combining the sets of indices for all dimensions. This is made clear in the example of a 
two-dimensional FALLS intersection shown in Figure 17. The source and target distributions 
for this redistribution are assumed to be the ones shown in Figure 15. As we can see, the 
intersection along the rows indicates all elements of row 0 are to be sent from source processor 
1 to target processor 1. The intersection along the columns indicates all elements in columns 
5, 7, 13, 15, 21, 23, 29, 31 are to be sent between these processors. Combining the two sets of 
indices gives us the set of elements (0, 5), (0, 7), (0, 13), (0, 15), (0,21), (0, 23), (0, 29), (0, 31). 
We indicate these via the shaded areas. 

Our multi-dimensional algorithm scales linearly with the number of dimensions involved. 
This is a significant advantage over the methods of [25]. 

A similar approach as the one above is used for the PITFALLS intersection in multiple 
dimensions. We consider the PITFALLS representation for source and target processor sets 
in each dimension and perform the PITFALLS intersection for that dimension using our 
algorithm of Figure 14. Later, we combine the results of these intersections and obtain 
rectangular sections of data that need to be transferred. 

2.5 Multi-array Redistribution 

For multiple arrays being redistributed from one processor set to another processor set, we 
pack all the data to be transferred between a pair of processors for all the arrays into a single 
buffer before sending i.ts contents. This way, we ensure that no more than one message is 
sent between processors even though they may communicate data for more than one array. 
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The advantage is that we have one long message instead of multiple short messages. 


2.6 Summary 

To summarize, in this section we developed the idea of a representation for regular distri- 
butions of a multi-dimensional array over a set of processors. We also provided an efficient 
algorithm based on this representation to compute the data transfer required for a redistri- 
bution. In the next section, we briefly discuss the implementation of our ideas in the context 
of the PARADIGM compiler and provide the results of a comparison of our technique with 
the runtime resolution method. 

3 Implementation and Results 

As mentioned earlier, the work proposed in this paper is aimed at supporting Multiple 
Program Multiple Data (MPMD) program generation in the PARADIGM compiler. Since 
the PARADIGM compiler is still under design and implementation, we tested our methods 
by automatically generating a set of functions for any given redistribution and executing 
these functions on the Intel PARAGON and CM-5. In order to generate these functions, we 
look at the source and target distributions of the array(s) being redistributed and generate 
the PITFALLS representation for both. Based on these representations and our PITFALLS 
intersection algorithm, we generate a pair of functions called GroupSend and GroupRecv to 
be executed by each of the source and target processors respectively. For the purposes of 
PITFALLS generation and intersection, we assume the processor subsystems are a contiguous 
block; i.e., if there are p processors in a subsystem, we number them 0 through p — 1. We 
call these the virtual IDs for the processors and provide structures in the GroupSend and 
GroupRecv functions for any processor to determine its virtual ID based on its real ID. 
When an actual SEND or RECV is performed by any processor, it needs to remap virtual 
IDs to real IDs using the same structures. The structures basically specify the IDs of the 
processors involved in the source and target processor sets. The idea behind having such 
structures is very similar to the concepts of Groups, Contexts and Communicators described 
in the MPI standard [20]. Due to the unavailability of reliable implementations of MPT on 
the machines we test our methods on, we chose to use our own interface. However, we can 
easily modify our code to use MPI when reliable implementations become available. 

We generated and timed our algorithm for a total of 27 redistributions using all possible 
combinations of: 
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Ds Dt 

BC(3),B C,BC(5) 


BC(3),BC(7) I BC(5),C 


B,A A,B 


Size 

128 x 128 


256 x 256 


512 x 512 


128 x 128 


256 x 256 


512 x 512 


128 x 128 


256 x 256 


512 x 512 


Ps 

4x4 

2x6 

3x5 

4x4 

2x6 

3x5 

4x4 

2x6 

3x5 

5x2 

3x6 

4x5 

5x2 

3x6 

4x5 

5x2 

3x6 

4x5 

8 x 1 

16 x 1 

10 x 1 

8 x 1 

16 x 1 

10 x 1 

8 x 1 

16 x 1 

10 x 1 


Pi 

3x5 

3x3 

4x3 

3x5 

3x3 

4x3 

3x5 

3x3 

4x3 

4x3 

5x2 

3x3 

4x3 

5x2 

3x3 

4x3 

5x2 

3x3 

1 x 16 

1 x 16 

1 x 18 

1 x 16 

1 x 16 

1 x 18 

1 x 16 

1 x 16 

1 x 18 


Naive(mS) 

23.91 

32.45 

26.28 

87.02 

125.72 

100.00 

344.16 

520.48 

396.89 
36.94 
33.40 
34.27 
140.12 
125.23 
126.07 
573.66 
498.43 
501.74 
29.62 
21.77 

26.51 

109.51 
75.84 
95.49 

441.90 

299.52 
372.34 


PITFALLS(mS) 

10.98 

11.30 

10.75 

35.70 

44.64 
40.20 
130.92 

168.56 
149.38 

16.56 
13.10 

15.86 
62.58 

44.87 
47.25 

231.96 

173.64 
174.73 

9.67 

8.16 

9.08 

33.60 

22.46 

29.96 

129.96 
80.04 
115.43 


Speedup 

2.18 

2.87 

2.45 

2.44 

2.81 

2.49 

2.63 

3.08 

2.66 

2.23 
2.55 
2.16 

2.24 
2.79 
2.67 
2.47 
2.87 
2.87 
3.06 
2.67 
2.92 
3.26 
3.38 

“ 3.19 
3.40 
3.74 
3.23 


Table 1: Results on the Thinking Machines CM-5 
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d 3 

D t 

Size 

Pa 

Pt 

Naive(mS) 

PITFALLS(mS) 

Speedup 

BC(3),B 

C,BC(5) 

128 x 128 

4x4 


21.47 

12.56 

1.71 




2x6 



10.43 

2.52 




3x5 

4x3 

22.56 

11.62 

1.94 



256 x 256 

4x4 

3x5 

66.42 

24.84 

2.67 




2x6 

3x3 

95.02 

30.64 

IKBQH 




3x5 

4x3 

75.01 

27.80 




512 x 512 

4x4 


239.60 

102.05 

2.35 




2x6 

EQI 

364.69 

132.93 

2.74 




3x5 

4x3 

289.66 

101.17 

2.86 

BC(3),BC(7) 

BC(5),C 

128 x 128 

5x2 

4x3 


11.97 

2.13 




3x6 



.12.90 





4x5 



14.60 

1.81 



256 x 256 

5x2 

4x3 


33.32 

2.75 




3x6 

5x2 

85.97 

29.11 

2.95 




4x5 

3x3 

87.23 

32.64 




512 x 512 

5x2 

4x3 

354.83 

130.67 

2.71 




3x6 

5x2 

317.44 

109.84 

2.89 




4x5 


331.56 

112.21 

2.96 

~B,A 

~A,B 

128 x 128 

8 x 1 

raSBB 

24.15 


2.81 




16 x 1 

1 x 16 

20.42 


2.17 




10 x 1 

1 x 18 

22.23 

8.55 

2.60 



256 x 256 

8 x 1 

1 x 16 

88.36 

21.05 

4.20 




16 x 1 

1 x 16 

67.71 

18.63 

3.63 




10 x 1 


74.38 

19.28 

3.86 



j 512 x 512 

8 x 1 

1 x 16 

336.76 

75.30 

4.47 




16 x 1 

1 x 16 



3.85 




10 x 1 



62.83 

4.56 


Table 2: Results on the Intel Paragon 
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L. 3 source-target distribution pairs: 

(a) BC(Z), B to C, BC(b) 

(b) 5C(3), BC{7) to BC{5),C 

(c) B, A to A, B 

where, .4 = ALL , B = BLOCK, C = CYCLIC and BC{x) = BLOC KCYC LIC(x). 

2. 3 arrays sizes: 

(a) 128 x 128 

(b) 256 x 256 

(c) 512 x 512 

3. 3 processor grids chosen independently for each distribution pair (shown along with 
the results in Tables 1 and 2). 

The distributions and processor grids were chosen to show that our method can work 
well for all of the regular distributions on arbitrary processor sets. 

To evaluate the effectiveness of our algorithm, we also implemented a runtime resolution 
algorithm (referred to in Tables as Naive) and timed it for the same set of redistributions 
(the details of such an algorithm were discussed in Section 1). The results of our study are 
tabulated in Tables 1 and 2. From these, we can make the following observations: 

• Our algorithm performs better than the runtime resolution algorithm in all cases. In a 
couple of cases for the smallest array, the performance improvement is not great; this 
is attributed to the fact that elements being transferred between processors in these 
cases are very scattered and not in clustered sections making addressing them very 
expensive. 

• The performance improvement becomes more appreciable as the array size increases. 
This means it is vital to use an efficient technique like ours for large array redistribu- 
tions. 

• It was of interest to compare the per element cost for the two methods as a function 
of array size for a particular redistribution. For this purpose, we selected the redistri- 
bution in which our method performs closest to the runtime resolution method for the 
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System 

Array Size 

Naive (/.iS) 

PITFALLS (pS) 

Intel Paragon 

128 x 128 

1.61 

1.45 

256 x 256 

1.33 

0.76 

512 x 512 

1.26 

0.60 

Thinking Machines CM-5 

128 x 128 

2.03 

1.77 

256 x 256 

1.91 

1.04 

512 x 512 

1.90 

0.78 


Table 3: Comparison of Per Element Costs 

smallest array. The values computed are tabulated in Table 3. From this table we can 
see that the per element costs drop very rapidly for our method as compared to the 
runtime resolution method. This indicates that the overhead factor of our method is 
very small for large arrays. 

• The improvement seems to be independent of the underlying machine. Both machines 
seem to show the same order of improvement. 

4 Conclusions and Future Work 

In this paper we have described a technique for carrying out array redistribution in an efficient 
manner. Our technique relies on a simple yet effective representation (PITFALLS) for regular 
distribution of arrays. This representation makes the communication analysis required for 
redistribution very simple and efficient. The results we provide show that our method is 
much superior to naive runtime resolution type of methods. The factor of improvement 
achieved is higher for large array sizes, making it critical to use an efficient technique like 
ours for redistribution. 

We are currently exploring the possibility of using the PITFALLS representation for 
general communication analysis in the PARADIGM compiler. We would also like to consider 
redistribution of sections of an array and not the entire array. Such redistributions are 
needed sometimes at procedure boundaries if the procedure called modifies only a section of 
the input array. We are also going to undertake a more thorough analysis of the overheads 
of our method and look into the possibility of reducing these overheads further. 
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